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Abstract. We present a short survey of the literature on indexing and 
retrieval of mathematical knowledge, with pointers to 72 papers and 
tentative taxonomies of both retrieval problems and recurring techniques. 


1 Purpose Driven Taxonomy of Retrieval Problems 


Retrieval of mathematical knowledge is always presented as the low hanging 
fruit of Mathematical Knowledge Management, and it has been addressed in 
several papers by people coming either from the formal methods or from the 
information retrieval community. The problem being resistant to classical content 
search techniques (LRG13J , it is usually addressed combining a small set of new 
ideas and techniques that are recurrent in the literature. Despite the amount of 
work, however, there is not a single solution that is the clearly winning on the 
others, nor convincing unbiased benchmarks to compare solutions. Some authors 
like |KK07j also suggest that the community should first better understand the 
actual needs of mathematicians from an unbiased perspective to improve the 
MKM technology as a whole. In this paper we collect a hopefully comprehensive 
bibliography, and we roughly classify the papers according to novel taxonomies 
both for the problems and the techniques employed. The only other surveys 
on the same topic are [AZ04] . now outdated and focused mostly on (European) 
research projects that contributed to the topic in the 6th Framework Programme, 
[ZB12] , which c overs less literature in much greater detail without attempting 
a classification, 
and 


L13 , which is focused on evaluation of mathematics retrieval, 


LiO , which is written in Slovak. 


We begin our discussion with a purpose driven taxonomy made of three dif¬ 
ferent retrieval problems that deal with mathematical knowledge. Each problem 
is characterised by its own set of expectations and constraints, and adopting a 
solution to another problem may be infeasible or yield poor results. In the next 
sections we classify the papers according to an encoding based taxonomy (pre¬ 
sentation vs content vs semantics) and to a taxonomy of techniques employed. 
Finally we point to the rich literature relative to the problem of ranking, and 
we touch the problem of evaluation of systems. We conclude with some notes on 
the availability of math retrieval systems. 


The final publication is available at http://link.springer.com 













Problem 1: Document Retrieval 


Objective: A human is interested in recalling a set of mathematical docu¬ 
ments (or fragments) that are related to a particular mathematical topic. Typ¬ 
ically it is not the case that only one document provides the correct answer; 
on the contrary the user may be interested in a corpora of different documents 
that yield different, only partially overlapping information. In |Kohl41 and other 
papers there are attempts at a classification of the information needs of users. 
However, at the moment only the system described in IZKT08j tries to use the 
classification to improve the user experience. 

Input: The human composes a query combining keywords (e.g. for topics 
[ ACK08I ). free text and mathematical formulae. Often the mathematical formu¬ 
lae are intended as examples of expressions related to the topic of interest. For 
example, a user interested in trigonometric identities can just enter one identity 
to retrieve them all. Or a formula showing a particular property of a special 
function can be used to disambiguate the special function among the ones with 
similar names. 

The query can be composed using a very simple, Google-inspired, single 
line interface, or written using an ad-hoc query language (see |AY07bl IAY08bl 
IYA07I for some proposals), or by filling in some form. The first solution is the 
one preferred in the literature. In | IKohl4| a comparison of the behaviour of 
mathematicians vs other users highlighted that the professional mathematician 
is more interested in the precision of the output than the effort put into the input. 
Therefore mathematicians may use and appreciate more complex interfaces. On 
the contrary, other users are likely to prefer a simple, modern search interface. 

Formulae can be entered in some textual syntax (e.g. DTj^X, MathML), 
maybe with the help of on-tlre-fly formulae rendering [LSR14] , or using graphical 
editors (MMOfij . or they can be acquired from hand-written snippets }AY08aj . 
The formula is likely to contain errors and ambiguities, for example if it is en¬ 
coded at the presentational level (e.g. in Presentation MathML or lATj^X), if it 
is acquired from hand-written text, or if the user only remembers it partially 
or in a wrong way. Errors and ambiguities are not a critical problem because 
formulae are just used to retrieve documents that contain similar formulae ac¬ 
cording to some similarity criterion. In [AY08a] the authors address the problem 
of combining and ranking results from different queries generated from ambigu¬ 
ous formulae due to errors in the recognition process. See also [ZB12] for a survey 
on the interaction between mathematical information retrieval and mathematical 
document recognition. Some authors IKT09I suggest that the visual presenta¬ 
tion of the formula may sometimes be important in the definition of similarity, 
whereas in other situations it is the mathematical content of the formula that 
matters. Logically equivalent formulae whose content encoding is highly different 
are better considered less similar. 

Once the search engine returns the result, the user may be given the oppor¬ 
tunity to enhance the query by further filtering. 



























Output: the output is a ranked list of matching documents or document 
fragments (e.g. a chapter of a book, or a section of an article). When the query 
involves mathematical formulae, the ranking is determined by the similarity re¬ 
lation. The user must be given the possibility to quickly determine whether the 
matched document is interesting or not. Therefore the problem of how to present 
summaries of the selected documents in the result list is of fundamental impor¬ 
tance ILSLM111ILSR141 IMGOSbl IWG10I lYouOSl lYouOfil 1You071 IYou08| . Even 
highlighting correctly the bits of the summary that matches the query can make 
a significant difference in the user experience [LSLMlll ILSR141 IYou051 IYou06] . 
The list of results must be the starting point for further investigations by the 
user. At least, all results must contain hyperlinks or other ways to retrieve 
the original document the summary points to. A study of user requirements 
in IZKT08: suggests that results should be presented after clustering them ac¬ 
cording to their resource type (research paper, tutorial, slides, course, book, 
etc.). For example, a student may immediately decide to skip research papers, 
and a researcher may skip websites and tutorials. 

Constraints: a balance must be obtained between precision (the fraction 
of retrieved documents that are relevant) and recall (the fraction of relevant 
documents that are retrieved). To maximise recall, precision is affected and many 
out of topic documents (false positives) are retrieved, penalising performance. 
Too many results are overwhelming and the user is likely to give attention only 
to the first ones in the list. Therefore, the search engine does not have to rank 
and produce summaries of documents with low scores. The ranking function is 
ultimately the one responsible for the perceived quality of the search engine. 

Since the query is intended to be issued by a human, the performance of 
the search engine is not a critical requirement and up to a few seconds (or even 
minutes in some particular situations) may be acceptable. Nevertheless, modern 
textual search engines like Google are extremely fast, and the user is likely to 
expect the queries to be solved in less than a second. 

Problem 2: Formula Retrieval 

Objective: A program — more rarely a human — is interested in retrieving 
all formulae that are in some relation 1Z with a query formula E. Sometimes 
the formula E can actually be a set of formulae. For example, E can be a 
goal to be proved automatically, and T 1Z E when T is the conclusion of the 
statement of a theorem that can be instantiated to prove E. More precisely, T 
contains metavariables to be instantiated and 7 Z is one-sided unification up to 
some equational theory. The dual query is also used in the literature: E is a 
property (a statement containing metavariables), 1Z is unification and the query 
finds all operations that satisfy the property E. By using several properties at 
once, the query can find all models of a given theory (e.g. all semirings in the 
library) }NK07) . also up to renaming of constants and properties. An interesting 
application presented in GK14) . that uses techniques similar to iNK07 i. con¬ 
sists in matching concepts across libraries by first computing properties of an 
object in one library (i.e. patterns like commutativity of a binary operator) and 




























then looking for objects in the other library that satisfy the same properties. 
To be more effective, properties are extracted from all libraries and concepts are 
matched according to a similarity measure to identify objects that satisfy a simi¬ 
lar set of properties. A third example is obtained by choosing logical implication 
for 1Z. The query looks for all formulae that imply E. 

Input: one or more formulae E that may or may not contain metavariables to 
be instantiated. Rarely, additional constraints can be expressed using keywords, 
classifications, free text, auth ors, etc . Formulae are not supposed to be ambigu¬ 
ous or contain errors. In |AGSC + 06] ambiguity is resolved before performing the 
query using type checking and interaction with the user. 

Some dedicated query languages are proposed to specify the structure of the 
formulae E [Ba.nOfilfTTR03lfTTHOil [CTSOTTlI kTToI IRa.bl 2| . They are implemented 
on top of relational databases or ad-hoc in-memory indexes. 

Output: the query is meant to retrieve a set of formulae that satisfy a certain 
property. When the search is performed by a program, there is no need to present 
summaries of the document the formula occurs in. Even when a human issued 
the search, an hyperlink to the document may be sufficient. 

In many situations the relation 1Z can be extended to a ternary relation 
T 7 Z p E meaning that T is related to E with score p , and the results can be 
ranked according to p. For example, if 1Z reduces a proof of £ to a proof of T, 
then the T’s may receive a higher score if they are judged easier to prove. 

Constraints: maximisation of the recall is fundamental. The query should 
return all formulae that satisfy the query, even if they rank very low. Be¬ 
cause searches are often basic operations of complex algorithms (e.g. automatic 
provers), speed is also critical. In several situations, the searches need to be 
performed in milliseconds. 

To speed up the searches or when the relation 7Z is undecidable, the search 
engine may use a second decidable relation 7 Z' such that 1ZQ1Z!. Using 7 Z', the 
query can return false positives, i.e. formulae T such that T 1Z' E but not 
T 1Z E. For example, when A is a pattern and 7Z is unification, 1Z! may ignore the 
structure of the two formulae E and T and conclude E 1Z! T when all symbols in 
E are also in T. Example: f(xz, y + z ) 7 Z /(?, ?+?) and f(y+z, xz) 1Z' /(?, ?+?), 
but not f(y + z, xz) 1Z /(?, ?+?). 

Problem 3: Document Synthesis 

Objective: Composing a new mathematical document assembling fragments 
to be retrieved from a library. The most common occurrence is in educational 
software where a learning object must be assembled according to the expertise 
of the user, the topic of interest, etc. BF03 > ! LDM + 08l ILM061 . An unrelated 
example is the automatic generation of summaries and statistics for a math¬ 
ematical library [BR03) . A final example is mining of formalised libraries, for 
example to build visual representations of the graph of dependencies over an 
axiom (to understand its implications) or a definition/statement (to understand 
the propagation of changes). 














A variant is to solve a mathematical problem by composing mathemati¬ 
cal (Web) services |CDT04I . Each service exposes metadata about the problem 
solved, the algorithm implemented, and its preconditions and postconditions. 

Input: the query is not likely to involve mathematical formulae, and it is 
usually expressed using a query language over ontologies. A high level interface 
may hide the underlying query language. Sometimes the query is fixed once and 
for all, and needs to be run at regular periods. 

Output: the expected output depends on the particular use case and it is 
usually made of a single result in place of a ranked list. The result may consist of 
a graph of objects and relations between them, or it may consist of the minimal 
information to build the expected document or solve the algorithmic problem. 

Constraints: the constraints depend on the particular use case. 

After an initial screening of the literature, we decided to analyse only papers 
about the first two problems, where formulae play a central role. Indeed, at a first 
glance most solutions to Document Synthesis employ standard query languages 
for ontologies, and only the ontologies themselves are math-specific (e.g. CDT04 
lLDM + 08] f. Logic programming languages are also employed to represent what 
the user knows/ignores and the inference rules to assemble documents [BF03 . 

Moreover, we did not find in the literature convincing examples for the need 
of very expressive query languages to solve the Document Retrieval and the 
Formula Retrieval problems, where the kinds of queries are essentially fixed a 
priori. Moreover, evaluation of queries expressed in these languages are reported 
to be too slow to be used for Formula Retrieval. Sometimes additional techniques 
are employed for Document Synthesis, like semantical query reduction to relax 
the user provided query by allowing additional topics close to the one specified 
by the user ILibl3j . These techniques too seem to be very general and applicable 
to domains very different from that of mathematics. 

Despite the strong interest of the community in the use of formulae in queries, 
studies on the behaviour of users (KK071 ; ZKT08] conclude that the added value 
may be low, and the finding is confirmed in | LM06l [Mill 3] where the logs of 
the DLMF and of ActiveMath search engines are analysed concluding that only 
few queries contain mathematical formulae, most are very simple ones, and such 
queries do not yield satisfactory results. 

2 Encoding Based Taxonomy 

Mathematical information can be encoded in a library at three different lev¬ 
els. The most shallow one is presentation. Presentation markup uses a Unitary 
language to express the bi-dimensional layout of a formula, useful to present 
it to the reader. The standard XML language for presentation is Presentation 
MathML , and several tools can generate Presentation MathML from DTjrpC(e.g. 
LaTeXML, Tralics), PDF files (e.g. Maxtract), handwritten text (e.g. InfTy 
Reader), digitised documents (e.g. InfTy Reader) or content markup (e.g. via 
XSLT stylesheets). We can therefore assume that the totality of the documents 
to be indexed are available in Presentation MathML. 


















The next level is content. At the content level, the structure of the formula 
is described, and symbols and operators appearing in it are linked to their en¬ 
try in an ontology, called content dictionary in OpenMath terminology. The 
markup language is Unitary, but the ontology is not since new mathematical 
entities can always be defined. The relation between content and presentation 
is one-to-many: the same presentation markup may represent different content 
expressions ( ambiguity ), and a content expression can be given different presen¬ 
tations according to the conventions of the community of readers, the language 
of the reader, but also for purely aesthetic reasons like constraints on the size 
of the formula. OpenMath and Content MathML are the two standard XML 
language for formulae at the content level. OMDoc is an attempt at standar¬ 
dising at the content level whole mathematical documents, comprising proofs. 
Content markup is currently mostly used for the exchange of formulae between 
systems, in particular CAS. There are no significant examples of large libraries of 
documents natively written using content markup. Nevertheless, there are tools 
like SnuggleTeX based on heuristics to semantically enrich (annotated) DTj^X 
documents or even MathML Presentation documents to content. 

The last level is semantics and it is specific to libraries of formalised math¬ 
ematical knowledge. The semantics level refines the content level by picking for 
every content level object one particular definition in a given logic. The defi¬ 
nition chosen embeds the object with additional properties, e.g. computational 
properties. For example, addition over natural numbers can be defined in the 
Calculus of Inductive Constructions as a non-computable ternary predicate in 
logic programming style, or as a recursive function on the first argument — such 
that 0 + x and x become logically indistinguishable — or as a recursive function 
on the second argument — so that 0 + x and x are not indistinguishable, but 
only provably equal. Interactive theorem provers often provides an XML dump 
of their internal semantics representation. 

Formula Retrieval is always formulated either on semantics markup or on 
content markup. Even when the semantic markup is available, it may be conve¬ 
nient to convert the library to content level by identifying alternative definitions 
of the same mathematical notion. In this way, it becomes possible to retrieve 
useful theorems on mathematically equivalent definitions, in the hope to reuse 
them after conversion to the definitions in use. One application of this technique 
is reuse of libraries across different systems based on the same logic. 

The Document Retrieval problem is formulated in a way that is agnostic 
of the encoding. However, the user is likely to enter formulae in the query us¬ 
ing a presentation language (mostly DT^X, even if MathML starts to be used 
(LSLMllj 1LSR141 IMG08b| b Some authors have provided evidence that preci¬ 
sion is improved when exploiting parallel markup, even when the content part is 
automatically generated from the presentation part [NKTA14] . See [MY08| for 
motivations against content/parallel markup and in favour of a more lightweight 
encoding of content information in Presentation MathML. Other authors claim 
that precision can be lost by embracing content because sometimes the actual 
layout used in a presentation or the name used for variables are significant. 












fGPBB14l in retrospect also described the choice of using Content MathML as 
a bad decision. Other authors dismiss indexing of Content MathML because 
of the non-availability of libraries or because of conversion from presentation 
to content being approximative and unreliable. Finally, |NKTA14l reports that 
automatic conversion of large formulae from Presentation to Content may be 
computationally unfeasible, and propose to limit the conversion to small ones. 

Recently, the debate on presentation only vs parallel markup seems to be 
solved in favour of the latter. For example, the system that scored better at the 
last NTCIR task reports better scores when applied to content markup generated 
from DTJ^X w.r.t. presentation only markup [RSL14j . The authors second this 
observation already in ILSR13] . 

Moreover, several works in the literature that deal with presentation markup 
enrich it - in the document itself or in the indexes — with additional annota¬ 
tions to make explicit additional semantics that is latent in the library |Cai04j 
or in the text surrounding the formulae IGPBB141IKTHA14I . For example, in 
[KTHA141 artificial intelligence is applied to the whole document to recover from 
the text surrounding the formulae the name associated to the mathematical en¬ 
tities in the formula (e.g. “posterior probability”, “derivative of /”). [GPBB141 
uses a cheaper approach by considering only one sentence around a formula, 
but it later observes that one sentence is often not sufficient and many relevant 
results are therefore missed. Another example is an analysis of co-occurrence of 
symbols in the corpus to identify related ones. It is shown that these techniques 
are important to augment recall or, sometimes, precision. In our view, like the 
heuristic based presentation-to-content translation, these are attempts to infer 
and store partial semantics of mathematical expressions. It may be questioned 
(see for example |MY08Q if the current content markups (OpenMath and Con¬ 
tent MathML) are the right instruments to augment presentation markup with 
partial, approximate semantics, and if such additional semantics makes only 
sense in the indexes of search systems, or it may be serialised to an XML format 
for being reused by third parties. 

Systems based on Content MathML, parallel markup or semantics appear 


in |Ban061 [BR03 ( BU04, GK14, HSliil IHKP141 lKPl3l lL13l 1LSLM111 ILSR131 


LSR14 ( MG08a, MG08b, MM06, NCH12, NT^7llBa.h12MSLM13llYAn9l[ZYT4] . 


3 Taxonomy of Techniques for Mathematical Retrieval 

Implementations of solutions to mathematical search problems can be obtained 
combining one of the main techniques that will be presented in Section 13.21 with 
a choice of modular enhancement techniques from Section 13.11 used to improve 
precision, recall or both. 


3.1 Modular Enhancement Techniques 

The following techniques are general enough to be applied to solve both the 
Document and the Formula Retrieval problems, and by analysing the literature 
it seems that every system eventually applies all of them. 














































Segmentation A preliminary step to indexing is segmentation of documents 
into chunks. Chunks are the unit of information to be returned to the user, with 
pointers to the parent document. Segmentation is trivial on formal mathematical 
documents, hard on web-pages, and intermediate on other resources like books or 
papers. See, for example, [ZKT081 for a discussion. Several systems implement 
segmentation; however, the last NTCIR competition has provided a data set 
of already segmented documents [AKQ14] and that may hinder the study of 
segmentation techniques in the future. 

Normalisation To improve recall, both formulae in the query and the formulae 
in the library are put in normal form before indexing them. Having the same 
normal form is an equivalence relation =, and the query retrieves formulae up 
to =. For Formula Retrieval it is necessary that =1Z= C 7 Z. For Document 
Retrieval the = relation must be compatible with the similarity and ranking 
functions. When this is not the case, precision can be critically lowered. 

Uses of normalisation include: repairing of broken XML/MathML gener¬ 
ated by automatic conversion tools [MM07 (e.g. when the structure imposed 
by <mrows> is not compatible with the mathematical structure); removal of 
information that does not contribute to the semantics like comments, layout 
elements (spaces, phantoms and linebreaks), XML/MathML attributes (color, 
font, elements in other namespaces) I I.HS12 HHN0811MM07) : picking canon¬ 
ical representations of the same presentation/content when different MathML 
encodings are possible (e.g. msubsup vs msup and msub, mfenced vs use of two 
parentheses, applications of trigonometric functions with/without using paren¬ 
theses, etc.) IAY07al fFLRS1211MM07] : replacing names of bound variables with 
unique numerical indexes (e.g. De Brujin indexes) to search up to a-conversion 
|MM071 NK071 ; ignoring parentheses and ordering of arguments of associative/- 
commutative operators |AY07al IMG08al IMG08bl INK071 ISY071 ISL111IYS06J : 
expressing derived notions exposing the derivation (e.g. replacing x > y with 
y < x, x $£ y with ~^(x < y), arcsin with sin -1 , etc.) )AY07a , MM07J : capturing 
logical equivalence/type isomorphisms (e.g. writing formulae in prenex normal 
form, currification of functions) [ DelOO , GK14 , NK07j . 

A normalised formula can be quite different from the original one, and that 
can be a symptom that the formula is not significant. Therefore in | RSL141 | 
normalised formulae are weighted according to their similarity to the initial one, 
and weights are considered during the ranking phase with great results. 

Approximation Normalisation does not lose information, converting a docu¬ 
ment to an equivalent one. Many papers call “normalisation” an approximation 
phase where subformulae are replaced with constrained placeholders to allow the 
formula to be matched by similarity. For example: names of variables or con¬ 
stants can be replaced by a single name IGWHT14] : all numeric constants by a 
single identifier (GWHTL411MM071 ISLllj ; subformulae may be replaced by their 
type. For example, in 111K1 J 111 type information is used to retrieve formulae by 
sorted unification, i.e. by constraints with type placeholders in patterns. Approx¬ 
imation improves recall. To limit the loss of precision, systems that approximate 












































index both the original and the approximated formula (or even several instances 
at different levels of approximation). The effects of approximation are similar to 
those of query reduction, but approximation is more efficient because it works 
at indexing time. 

Enrichment Enrichment works on the library or on the query to augment the 
information stored/looked for in the index by inferring new knowledge from exist¬ 
ing one. It can contribute to the solution of both the Document Retrieval and the 
Formula Retrieval problems. Typical examples of enrichment are: heuristically 
generating and storing content metadata from Presentation MathML IMY08j : 
automatic/interactive disambiguation of formulae in the queries to perform a 
precise query at the content or semantics level |AGSC + 06l IBan061 IBR031IBU04] ; 
automatic inference of metadata from context analysis or usage analysis (latent 
semantics) |Cai04[ IKTHA141IWG10] , 

The most impressive application of enrichment is presented in |HQ14| . The 
aim is to search for geometrical constructions that are described using a proce¬ 
dural language (e.g. draw the segments connecting A with R, B with C, and 
A with C). Enrichment consists in replacing the procedural with a declarative 
description (e.g. ABC is a triangle). The same declarative description can be 
obtained by multiple procedural ones, and thus recall is greatly improved. The 
technique can also be seen as a form of normalisation (see Section m where 
the normal form is not unique (e.g. it may be the case that by analysing the 
hypothesis one could deduce that ABC is also an equilateral triangle even if that 
is not stated in the procedural description). 

Query reduction Query reduction trades precision for recall by selectively 
dropping or weakening some of the constraints present in the query. Results 
obtained from reduced queries can be ranked after results from precise queries. In 
the literature it occurs in many forms in solutions to both the Formula Retrieval 
and the Document Retrieval problems: a constant can be weakened to other 
constants that co-occur frequently with the given one; constants that occur too 
frequently can be dropped from the queries; a formula may be required to match 
only the toplevel structure of the formula given as a query. 

3.2 Main Techniques 

The following techniques are mutually exclusive. Moreover, each technique per¬ 
forms better on only one of the two problems. 

Reduction to full-text searches The technology to perform full-text searches 
is very advanced and there are popular open software implementations with good 
performance like Apache Lucene/Solr and ElasticSearch. The benefits of reduc¬ 
ing search for formulae to full-text searches are speed of execution of the queries 
and the combination of formula based and textual searches almost for free. The 
main drawback is that the precise structure of a formula is partially lost in the 
translations proposed in the literature, and that it is impossible or very hard 










to capture precisely the kind of relations 1Z used for Formula Retrieval, unless 
1Z is approximated by a much coarser relation 1Z'. Therefore the technique has 
been successfully applied so far only to Document Retrieval 1ACK081IGWHT14I 
IGPBB14I IHHN081 IKTHA141 ILM061 ILSLM111 ILSR141 IMill3l IMY03I IMM071 
IMGOSal iMGOBbl lPZl4l iMMOfil ISLTTl lYouObl lYoffiT7l I You08| . 

All the proposals employ vectors to represent features, and compare features 
with weighted cosine distance. The usual approach consist in turning an expres¬ 
sion into a (large) set of “sentences” that partially describe the formula. For 
example, in [KTHA141 [TKNA13] a sentence is the set (ordered or not) of sym¬ 
bols found in either a path from the root of the formula to a leaf, or as children 
of the same node. Matching is then performed by a disjunctive query and results 
are ranked using TF-IDF and length normalisation. As the authors claim, the 
system “is too flexible: it is difficult to say where the relevant results stop and 
random matches begin; thus we predict higher recall but lower precision rates 
than exact match systems”. Other authors extract sentences or n-grams that 
capture the formula more precisely. As a general remark, the clear impression 
we got from the literature is that the fewer features extracted, the lower the 
precision. All kinds of techniques can be used to extract the features, comprising 
regular expressions IACK08I and finite state automata [NCH121 . 

Some systems cluster documents at indexing time (e.g. [ACK08j ). and re¬ 
trieve documents comparing the feature vector of the query with the centroid 
of the cluster. For example, documents about trigonometric functions are likely 
to be automatically clustered together. However most systems do not seem to 
cluster in advance, and prefer the flexibility of weights to capture similarity of 
features (e.g. similarity of occurrences of trigonometric functions). 

According to the set of features extracted, the weighting function used, and 
the other modular techniques used in combination, the accuracy achieved by 
systems based on this technique range from extremely low to extremely high 
(see, for example, [AKQ14] ). 

Structure-Based Indexing via Tries/Substitution Trees Formula Re¬ 
trieval can be solved with the data structures developed for automatic theorem 
proving to store libraries of lemmas and quickly retrieve formulae up to instanti¬ 
ation/generalisation. Pointers to all the statements are stored in the leaves of a 
tree that precisely encodes in its paths the statements. To match a formula, the 
tree is recursively traversed using the formula to drive the descent. The relations 
7 Z that can be captured are only instantiation and generalisation of whole for¬ 
mulae. MathWebSearch [HKP141IKP13I and iKTIOl are based on this approach. 
Retrieval of formulae is very fast, assuming that the index can be entirely stored 
in main memory. 

This approach consistently maximises precision but presents poor recall. To 
accept larger relations, or to be applied to Document Retrieval, or to cope with 
too rigid queries, the technique needs to be integrated with other ideas. For 
example: to match subtrees of formulae in the library, every subtree of a lemma 
needs to be stored as well in the index; to solve unification problems up to 
an equational theory that admits normal forms, all formulae are normalised; 






































to allow queries that use keywords or free text, a free-text search engine must 
be run in parallel and the results need to be combined in the ranking phase 
[HKP141 IlDM+08] . 

Reduction to SQL or ad-hoc queries The third approach consists in ap¬ 
proximating formulae via relations to be stored in a relational DB 1AS04[ 1GS03) . 
An alternative consists in storing the relations in ad-hoc indexes in memory, and 
it is employed when the indexes already exists for other purposes (typically in li¬ 
braries of formalised knowledge) IBan0611BR031IBU04I . The technique is applied 
to Formula Retrieval and the database can be reused for Document Synthesis 
without modifications. Approximated queries up to generalisation/instantiation 
can be made efficient |AS04| without requiring an index stored in main memory 
for Structure-Based Indexing (see page HOD . Recall can be maximised by relax¬ 
ing the representation of formulae as relations or by employing normalisation. 
Ad-hoc inverted indices for paths and to map each Content MathML node to 
its parent have also been used in |HS13j : the search engine is very fast, but the 
precision obtained is low. 

Reduction to XML-based searches Some systems |AY07bl AYOBbi YA07j 
that index MathML documents at the content level, base their searching ca¬ 
pabilities on the existing XPath/XQuery technology. The system described in 
jSLM13j . which is based on Stratosphere, is batch oriented, trades flexibility with 
performance, and it is essentially math-unaware (for example, it does not nor¬ 
malise the input in any way). Other systems 1CDT041 lLDM + 08j , that deal with 
ontologies indexed in the Ontology Web Language, rely on third-party OWL 
search engines implementing graph matching. 

4 Ranking 

Because users only inspect the first results returned by a query, precision when 
solving Document Retrieval is strongly determined by the ranking function. 
Ranking is also of paramount importance for Formula Retrieval: when the search 
retrieves the candidates for progressing in a proof, correctly ranking the results 
may dramatically cut the number of wrong proof attempts and backtracks. The 
ranking criterion for the two Problems is, however, very different: for the first 
problem similarity of formulae in the query and in the results should contribute 
significantly to the score; for the second problem the score should be determined 
by the intended use of the results. For example, a lemma L\ that exactly matches 
the goal to prove and has no premise should always score better than a lemma 
L 2 that also exactly matches the goal, but that has hypotheses to be proved 
later. 

Ranking according to the intended use for Formula Retrieval has received 
very little interest in the literature we examined. On the other hand, several 
papers explicitly address ranking for Document Retrieval. The consensus seem 
to be that a good ranking function needs to be sophisticated and that the usual 





















metrics induced by reduction to textual searches are completely inadequate (see, 
for example, |You07n . All the proposed ranking techniques are strongly based 
on heuristics and, unfortunately, most of them are incomparable and hard to 
combine. 

One class of metrics takes into consideration also the structure of the formu¬ 
lae involved and the enriched semantics, when available. For example, [ZY141 
heavily exploits Content MathML to rank results by considering the taxonomic 
distance of constants (where close is approximated to being defined in the same 
content dictionary), the data type hierarchical level (matching a function is 
more significant than matching a numerical constant), matching depth (par¬ 
tially matching the formula at the top level is more significant than matching 
a deeply nested subformula), coverage (percentage of formula matched), kind of 
matched expression (formula vs term). All this information needs to be com¬ 
puted and amal gamated employing some kind of heuristic algorithm. A second 
paper ISYM + 14l confirmed that each one of the listed similarity feature factors 
significantly improves the ranking, but the last one, that still contributes, has 
lower relevance. 

In [SLlTI ranking is determined by the weights used during matching, and 
the authors claim that each document base and scientific field should have its 
own weighting function. Nevertheless, they “tried to create a complex and robust 
weighting function that would be appropriate to many fields”. 

In IYou07j the author proposes a parameterised ranking function that works 
on mathematical documents (not only formulae), that seems applicable to en¬ 
riched presentation and that weights a lot of additional information including 
keywords, the number of cross-references and their kind (e.g. definitional vs 
propositional). Ranking employs a hybrid of scalarisation and vectorisation. 

In IKT131 the authors propose to adopt tree edit distance to measure similar¬ 
ity of formulae. Most of the paper is about optimisations to improve efficiency 
of ranking because tree edit distance is hard to compute. The final proposal 
combines some clever memoisation and a procedure to quickly prune documents 
bounding their similarity scores with a lightweight computation. The paper also 
shows benchmarks comparing the processing time and success rate of most search 
and ranking algorithm in the literature, reimplemented by the authors and run 
on the same dataset. From the benchmarks the method proposed seems to be 
superior, but the implementations do not exploit relevant enriched information 
like cross-references, semantic proximity of definitions, etc. The benchmarks are 
therefore non conclusive. 

In [NCH12] the authors employ a continuous learning ranking model after 
having extracted features from Content MathML mathematical formulae using 
a finite state automata. Benchmarks show their ranking to be superior than 
the ones used in classical ranking of textual documents. However, they do not 
compare with IKT131 or You07 (that work on Presentation MathML). 

Simpler approaches to ranking can be found in |YA09| (based on Subpath 
Set, reported to work well only on “simplified” Content MathML) and |KT09] 
that works on Presentation MathML and measures similarity as a function of 
























the size of subtrees in common. The ranking metric used in [ACK08] is a TF- 
IDF modified with weights to assign more importance to some operators, but 
the details given to determine the weights are insufficient. 

Ranking algorithms can be too complex to be incorporated in the search 
phase, for example when using Lucene technology. Moreover, they are typically 
slower than the search phase. Therefore several authors suggest to re-rank only 
the first results of the query, that employs a simpler ranking measure to deter¬ 
mine the interesting candidates to be ranked more accurately |KT131 You07l . 

An algorithm to automatically categorise documents is presented in [ZKT08I , 
where it is argued that clustering documents according to their category greatly 
improves the usefulness of the tool for the user. 


5 Evaluation of Math Information Retrieval 


Several papers present benchmarks on the systems proposed, and rarely com¬ 
pare them with reimplementations of the algorithms found in the literature (e.g. 
(KT13n . The significance of most of these benchmarks is unclear, because con¬ 
flicting results are found in the literature, most techniques are not presented in 
sufficient details in the papers to be exactly reproduced, and systems are very 
sensitive to the kind of queries examined. The only alternative is to compare 
different tools on unbiased, standard benchmarks that are currently lacking. 

The main issue is not to come up with large corpora of documents: at least 
for Document Retrieval on enriched Presentation MathML documents, a large 
corpus can be easily obtained converting documents from ArXiV, DLMF, Plan- 
etMath, Wikipedia, etc. For Formula Retrieval, the existing libraries of interac¬ 
tive theorem provers, like Mizar and Coq, can be directly used after conversion. 
The problem is to determine large sets of real world, interesting queries , and to 
evaluate the results. Automatic evaluation is particularly hard in the domain of 
mathematics, whereas manual evaluation is limited to a tiny number of queries 
and runs. Formulating good sets of queries is also complex, because users with 
different mathematical background and motivations are likely to issue different 
queries. Moreover, what makes a query hard can just be the use of non standard 
mathematical notations, errors in the encoding of formulae, or formulation at 
the wrong level of abstraction. L 131 discusses the problem at length and re¬ 


views the state of the art of evaluation of Math Information Retrieval before 
2013, including the experience of the MIR workshop at CICM 2012 were two 
systems were compared on about ten hard queries proposed by the judges, and 
the conclusion was that the systems were too sensitive to the formulation of the 
query. 

The situation is improving since 2013 with the creation of a math oriented 
task in the NTCIR initiative [AKO13, 1AKQ141 that is attracting a small, but 
increasing number of participants [GWHTMl IGPBB141IHS131IHKP141IKP131 

IKTHA14llLRGl3llLAP+l4llLSR13llPZT4llBSL141ISLM131ISYM+14HTKNAT3l . 


The initiative is too young to come to definite conclusions and the current choice 
of tasks and queries is not granted yet to have significant coverage and to be 



























unbiased. For example, in IKTH A14] the authors report that despite several 
improvements to the tool (quantified via NTCIR-11 runs), their tool scored lower 
than in NTCIR-10. They justify the phenomenon by noticing that “in NTCIR- 
11, query variables get much bigger emphasis, most topics feature complete and 
very particular formulae, and sub-formulae matching is not nearly as useful as 
before”. Indeed, as reported in [AKQ14 “the design decision ... to exclusively 
concentrate on formula/keyword queries and use paragraphs as retrieval units 
... has also focused research away from questions like result presentation and user 
interaction. ... few of the systems has invested into further semantics extraction 
from the data set. .. .We feel that this direction should be addressed more in 
future challenges”. An effect of the bias towards search up to unification w.r.t. 
search up to similarity is observable in [PZ14I too: the system proposed works 
very well even if it works on Presentation MathML only and the set of features 
extracted is very simple (bag-of-symbol-pairs model, where a pair is made of 
two symbols in a father-son relation). The reason why it works well is that the 
authors also index approximated pairs where the child is a wildcard, and in the 
future works they are thinking at improving even more the handling of wildcards 
to score better. In comparison, most other systems based on feature vectors just 
replace wildcards in the query with <m: ci> identifiers. The emphasis on formula 
queries is also to be evaluated considering the already cited works that conclude 
that users do not see (yet?) much value in them 1KK07I |ZKT08| . 

Finally, the NTCIR task does not cover distinctly the Document Retrieval 
and the Formula Retrieval problems, but only Document Retrieval with an em¬ 
phasis on exact pattern matching of formulae that should be more distinctive of 
Formula Retrieval. 


6 Availability of Math Retrieval Systems 

Most of the systems described in the literature are research prototypes, and the 
majority of them are no longer working or no longer accessible. At the time 
this paper was written, the only ones for Document Retrieval with a running 
Web interface or code that can be downloaded are: 1) Design Science’s Math- 
Dex (formerly MathFind) 2) NIST DMLF 3) MathWebSearch 4) MiAS (Math 
Indexer and Searcher), also used to search the EuDML 5) the system described 
in [PZ14] . In addition to those, the following commercial systems are also acces¬ 
sible: a) Springer LaTeXSearch b) Wolfram Alpha. Systems a), 1), 2) and 5) are 
based on Presentation MathML; systems 3) and 4) can use either Presentation 
MathML or Content MathML/parallel markup, but work better on the latter; 
finally system b) actually generates on the fly most of the result of the query, for 
example by plotting functions, computing their Taylor expansion, etc. It does 
not really qualify then as a search engine. 

Most interactive theorem provers also have their own implementation of a 
search engine to solve Formula Retrieval. Most of the time, the implementation 
is embedded in the system and does not work on the whole library at once, with 
the exception of MML Query for Mizar. 












7 Conclusions 


Mathematical knowledge retrieval, the low hanging fruit of Mathematical Knowl¬ 
edge Management, is still far from being grasped. Despite the significant amount 
of work dedicated to the topic in the last 12 years, only a few systems are still 
available, and their precision and recall scores compared to other knowledge re¬ 
trieval fields are low. Moreover, usability and user requirement studies suggest 
that queries containing formulae — the main focus of the majority of papers — 
are perceived by users as not very useful (yet?). 

The main contributions of this paper have been providing an hopefully com¬ 
prehensive bibliography on the subject, and presenting taxonomies for both 
mathematical retrieval problems and techniques. We believe that our purpose 
driven taxonomy can be useful in classifying papers, in clarifying the scope of ap¬ 
plication of techniques and in the much needed development of unbiased bench¬ 
marks for mathematical retrieval. 
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